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Abstract 

Data-based classification is fundamental to most branches of science. While recent years 
have brought enormous progress in various areas of statistical computing and clustering, 
some general challenges in clustering remain: model selection, robustness, and scalability 
to large datasets. We consider the important problem of deciding on the optimal number of 
clusters, given an arbitrary definition of space and clusteriness. We show how to construct 
a cluster information criterion that allows objective model selection. Differing from other 
approaches, our truecluster method does not require specific assumptions about under- 
lying distributions, dissimilarity definitions or cluster models. Truecluster puts arbitrary 
clustering algorithms into a generic unified (sampling-based) statistical framework. It is 
scalable to big datasets and provides robust cluster assignments and case-wise diagnostics. 
Truecluster will make clustering more objective, allows for automation, and will save time 
and costs. Free R software is available. 

Keywords: bagging, clustering, truecluster, MMCC, CIC 
1. Introduction 

The power of modern computers has revolutionized the way we do statistical analysis. 
Computer-intensive simulation met hods, such as calculation of standa r d erro rs via boot- 
strapping in frequentist statistics |ifron1 . [l97gl : lEfron and TibsWil S) or MCMC 
methods in Bayesian statistics, have become increasingly important in order to address 
the two big themes in statistical data mining: 'prediction' and 'clustering'. An important 
problem in both areas is model selection: how to find a statistical model optimally adapted 
to the true patterns in a data population, while avoiding overfit to the sample? 

In predictive modeling, early approaches tried to penalize too much flexibility by sub- 
tracting the p a ramet ric model's degrees o f freed om from the log-likelihood, namely AIC 
(jAkaikel . Il97.4 Il974l ) and BIC (jSchwarzl . Il978l ). Other approaches made use of cross- 
validation or boot s trapp ing with shrinking in order to estimate and correct the amount of 
overfit, cf. Harrelll ( 2001 ). Recent scalable methods combine resampling and model averag- 
ing: they optimize mo del complexity, minimize overfit, an d result in very r obust predictions, 
for example, bagging (jBreimanl . 1996) or random forests (jBreimanl . l200ll ) . 

Contrary to these advances in predictive modeling, cluster analysis has not yet reached 
the same maturity in scalability, robust ness, and model selection: while a multitude of 



powerful algorithms h as been developed (IHalkidi et al.l . |2001| ; iBerkhinl . 120021 : IZaiane et al. 



20021 ; I Jain et all 12004 ) . the field lacks a coherent statistical framework for model selection: 



up to now the decision for an optimal number of clusters within a single model class has not 
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been generically solved, not to mention comparisons across model classes. Identifying the 
correct number of clusters is of great practical imp ortance and — as a genera l prob l em — has 
not b een conside red solved for more t han 50 years dThorndikel. Il95.4 lEverittJ. Il97fll: iGordonl . 
19991 . chapt. 3.5; lEveritt et all . bOQll . chapt. 5.5; iGordon and Vichil . bOQll ; iDolnicari . I2003I ). 



We present here a generic statistical framework for the selection of the optimal number 
of clusters within a single model class. We finally derived the framework under three major 
restrictions: the method must be scalable to large datasets, must deliver robust cluster 
results, and must provide useful case-wise diagnostics. Furthermore, we were looking for 
an approach that can easily be im plemented without proprietary software, preferably in R 



R Development Core Team! 120031 ) in order to make it available for research and application 
Milligan , 119961 ) . We organize the paper as follows: in Section [21 we define the problem and 



review related work, in Section El we introduce the truecluster framework, in Section IH we 
briefly summarize the truecluster matching and voting scheme, in Section[5l we introduce the 
cluster information criterion CIC and then discuss scalability in Section[6j In Section^ we 
illustrate the truecluster method applied to a real world example and, finally, in Section[8l we 
discuss the benefits and restrictions of truecluster. Appendix A shows trueclu ster results for 
some illustrative artificial datasets. Truecluster software is work-in-progress (jOehlschlagell . 
2007al ) and currently offers truecluster matching (see ?matchindex) and calculation of the 



cluster information criterion (see ?CIC) 



2. Problem definition and related work 

We consider the following general setup for cluster analysis: N cases in an M-dimensional 
feature space sampled from an infinite population will be classified into an optimal number 
of K distinct clusters, given some definition of 'clusteriness' represented by a base-cluster 
algorithm that takes K as input parameter. The number of possibilities to assign cases to 
clusters grows exponentially with the number of cases. Checking all the possibilities for an 
optimal solution is prohibitive even with tomorrow's computers. This is one reason for the 
existence of so many cluster algorithms — and the need to validate the resulting models. 

Our problem definition excludes base cluster algorithms that do not allow the num- 
ber of clus ters to be specified in adva nce, for example, the k-th nearest neighbor selection 
method by Wong and Schaack ( 19821 ). Such algorithms deliver an automatically emerging 
number of clusters, but usually require subjective choice of other — often continuous — input 
parameters that are even more difficult to optimize. For obvious reasons, we also exclude 
well-defined decison problems (and algorit hms), wh e re the optimal number of clusters fol- 
lows from minimizing a loss-function, see iMacKav (|200i chapter 28, m odel comparison, 
and O ccam's razor). For example, the affinity propagation algorithm of iFrey and Dueck 
(I2007T ). which automatically delivers an optimal set of 'exemplars', is in fact a method for 
cost optimization: both inputs, objective 'similarities' and subjective 'preferences', can be 
interpreted as (inverse) costs. 

We briefly summarize some of the existing approaches to cluster validation and model 
selection and point out important restrictions for each of them. We do not consider methods 
here that are designed to only work with one specific cluster alg orithm such as the maximum 
span ning tree stopping rule for single linkage agglomeration (IKrolak-Schwerdt and EckesI . 
1 19921 ). 
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Approaches that calculate a goodness-of-fi t (GOF) criteri on with respect to a definition 
of clusteriness, often termed 'int ernal criteria ' ( Milligan . 198ll ). are the most widespread. An 
example is the silhouette width (IRousseeuwI . 119871 ) which evaluates the quality of separation 
of convex pa rtitions (with respect to cas e dissimilarity). The 30 cluster evaluation indices 



compar ed by lMilligan and Cooperi (119851 ) and the 15 indices compared by lDimitriadou et al 



(|2002al ) are based on the GOF approach: cluster solutions are evaluated in the N x M 
feature space or in the N x N dissimilarity matrix. Evaluating GOF has its justification in 
the fact that most cluster algorithms do not guarantee finding an optimal clus t ering and are 
often sensitive to outliers, for example, the widespread k- means (jMacQueenl . Il967l ). Since 
each GOF index relates to a spec i fic de finition of clusteriness, 'no single superior procedure 
can be recommended' ( Dolnicar . 20031 ) as a general method for deciding on the optimal 
number of clusters. Furthermore, using GOF as a decision method is complicated by the 
fact that GOF is often biased with t he number of clusters and decision makers are asked to 
identify a 'knee' (jHalkidi et al.l . l200ll . p. 129) in a plot of GOF versus the number of clusters. 
In order to w ork around such bi as, some GOF indices su ch as the cubic clustering criterion 
(jSarlel . ll983h or the gap statistic (jTibshirani et all . l2001al ) relate GOF to the expected GOF 
under a null hypothesis, see below. 

More systematic approaches to decisions about the optimal number of clu sters have been 
developed in the context of parametr ic probabilis tic models. Smyth ( 19961) distinguished 
three approaches: hypothesis testing (Bock . 1996 ), f ull Ba yesian analysis ( MacKay . 20031 . 
part IV) such as AutoClass dCheeseman and Stuta. Il996l). and penalized likelihood suc h 
as BIC dBanfield and Raftervl . 1199.4 iKass and Raftervl . Il99.4 iFralev and Eaftervl . Il998l ). 
Smyth concludes that 'In theory, the full Bayesian approach is fully optimal and probably 
the most useful of the three methods listed above. However, in practice it is cumbersome to 
implement, it is not necessarily straightforward to extend to non-Gaussian problems with 
dependent samples, and the results will be dependent in a non-transparent manner, on the 
quality of the underlying approxi mations and s imulations. Thus, there is certainly room 
for exploring alternative models.' (Smyth, 19961 . p. 127). For example, the EM-algorithm 
underlying the above mentioned BIC does not scale easily to big samples. Smyth suggests 
Monte Carlo cr oss validation which he fou n d per formed as well as AutoClass and better 
than the BIC. Chickering and Heckerman ( 19971 ) found the BIC to work reasonably for 
model selection and confi rmed superiority of AutoClass over BIC for model averaging. 
Dimitriadou et al.l (|2002al ) also reported problems with the BIC in latent class analysis. 
In summary, for parametric probabilistic models, acceptable model selection methods are 
available. Still, correctness of these methods rely on correct parametric assumptions and 
the methods don't easily translate to non-parametric definitions of clusteriness. 

The most general approach to cluster validation is given when feature space or dissimi- 
larity space are completely ignored and we just compare agreement of cluster results from 
several 'disturbed' solutions, for example, solutions from bootstrapping, cross-validation 
or feature sampling. Transferring ideas from prediction model validation, several authors 
have stressed the i mportance of validating independent (non-over l apping ) sub-samples in 
order to avoid bias ( Dudoit and Fridlyand . 20021 : Tibshirani et al. . 2001bl ). However, even 
independent resampling schemes do not achieve true independence with respect to clus- 
tering: spatial neighbors have a higher likelihood of being clustered together. Th erefore, 
even 'random-corrected' agreement indices, such as Cohen's kappa (Cohen, 1960l ) or the 
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random-corrected version ([Hubert and Arabic! . Il985l . crand) of the rand index (|Randl . ll97lh . 
will show 'non-random' agreement in n on-clustered random data. In order to work around 
such bias, iDudoit and Fridlvandl (j2002h have suggested relating cross-validation agreement 
to performance of the same agreement index (and the same base cluster algorithm) in 
simulating from a reference null distribution. 

We have seen that the difficulties with the GOF approach and cross- validated agreement 
indices have led to suggest resorting to assuming (or simulating from) a reference null 
distribution: the null hypothesis of random clustering. Such a null distribution is supposed 
to represent a neutral no-cluster situation; however, the choice of a null distribution is 
subjective and can influence the results. Take a 2-dimensional uniform random distribution: 
the square shape will induce artificial agreement for a 4-cluster k-means solution, something 
that will be different using a multivariate normal null distribution. Therefore, we are 
looking for an approach that treats the base cluster algorithm as a black box, works with 
any definition of clusteriness, and requires no assumptions about null-distributions, variable 
space or dissimilarity definitions. 



3. Truecluster 

The first idea for truecluster dates back to 1996 when we tried to select the best number 
of clusters for a given base-cluster algorithm by evaluating the stability of each K-cluster 
solution via repeated split-hal f cross-validat ion. In 1997 we got to know the draft of Harrell's 



book and S software library (H arrelll . |2001| ) that suggests bootstrapping for the validation 
and calibration of regression models. We were fascinated by the idea of comparing the 
stability of cluster models fitted to resamples of the same sizes as the original sample size, 
since in split-half samples stability could be biased downward. On the other hand, Harrell's 
approach involved comparisons between models built on overlapping data and, thus, could 
estimate stability biased upward. We experimented intensively with both approaches and 
came to the conclusion that both are biased. Surprisingly, we found a second source for 
a reduced split-half stability, due to outliers that — systematically — are in one but not in 



the o ther split-half sample. Then we learned about Breiman's work on bagging (jBreiman 



l~996h and experimented with aggregating K x K agreement counts (or agreement statistics 



based on them) between many pairs of cluster solutions. In order to create K x K cluster 
agreement counts we would either need overlapping samples as in bootstrapping or — in the 
case of split-half — to assign the out-of-resample cases to the clusters found in resampling. 
We concluded that the K x K matrices don't contain enough information for proper model 
selection. Consequently, we turned to aggregating the cluster assigments themselves in a N 
x K matrix in order to creat e a cluster version of bagging , very similar — but not equal — to 
the BagClustl algorithm by Dudoit and Fridlvandl ( 20031 ). An aggregated N x K matrix 



would not only promise to contain enough information for model selection but also allow 
the creation of new, more robust cluster assignments with case-wise diagnostics and to scale 
expensive base cluster algorithms to larger samples. We decided not t o go for aggregation of 



the ev en more informative JVxJV co-occurence counts as suggested bv lDudoit and Fridlvand 



(2003, BagClust2) because this requires 0(A^ 2 ) space complexity. Given this decision, the 



truecluster approach involves two steps described in the next two sections: 
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1. For each number of clusters K, aggregate the results of many resamples in an N 
x K matrix Ck using a specific matching and voting logic called multiple match 
cluster counts MMCC and convert to an N x K matrix that contains estimated 
probabilities P^: how likely it is for each case i € {1...N} to be assigned to cluster 
k € {1...K} across many resamples (see Section H]). 

2. For each number of clusters K, calculate a cluster information criterion CICk using 

as input. Choosing the model with the highest CICk gives the optimal number 
of clusters K (see Section [5]). 

For simplicity, we drop the index K from all the following notation. 



4. Multiple match cluster counts (MMCC) 

Dudoit and Fridlvandl feooal ) initialize BagClustl by applying the base cluster algorithm to 
the complete sample and use this as a reference for permutating the labels of the bootstrap 
clusterings that are to be aggregated in C. Because some base cluster algorithms don't 
scale to arbitrarily large samples this choice is not generally applicable. Aggregating votes 
from subsampling can help, but one needs sufficient overlap for permutation of the sub- 
sample cluster labels. Complete overlap can be achieved by assigning the out-of-resample 
cases to the resample clusters. Such prediction is often computationally less expensive com- 
pared to the base cluster algorithm. The resulting partially-predicted full-sample solutions 
can be used to initialize C and for the subsequent voting. Unlike BagClustl, no single 
solution serves as a reference for label permutation, instead C itself is used, because C 
becomes a better c luster r epresentation with ongoing voting, similar to the suggestion by 
Dimitriadou et al.1 (|2002bh . C (and P) can be interpreted as a fuzzy cluster solution and 
can aggregate results from fuzzy base cluster algorithms, however, both (fuzzy resample 
solutions and the C reference) need to be converted to a crisp cl ustering before doing the 



label permutation . This differs from fuzzy consensus clustering (| Gordon and Vichil . 12001 



Dimitriadou et all l2002bh and guarantees that P can be given a probability interpretation 
which is crucial for the CIC evaluation. 

While consensus cluster methods for crisp dStrehl and Ghosh! . 120021 ) or fuzzy cluster 



ensembles (IGordon and Vichil . l200ll : iDimitriadou et all l2002bl ) aim to minimize euclidean 



distances between cluster representations in the ensemble, the label permutation in MMCC 
uses a different matching criterion. The reason for the difference is that consensus clustering 
tries to find the deterministic optimal representation of a finite cluster ensemble, whereas 
MMCC is a probabilistic algorithm trying to converge an appropriate representation C 
for a single c l uster m odel. Details of truecluster ma tching are given in a separate paper 
(jOehlschlagell . l2007bh and free software is available (jOehlschlagell . l2007al ). The standard 
MMCC algorithm can now be described as follows: 

1. Create a N x K matrix C and initialize each cell Cj & with zero. 

2. Take a resample (with replacement) of size n, use a base cluster algorithm to fit the K- 
cluster model c* to the resample. Then use a suitable prediction method to determine 
cluster membership of the out-of-resample cases to get a complete cluster vector c 
with N elements c„- . 
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3. For each row in C, add one vote (add 1) to the column corresponding to the cluster 
membership in c . 

4. Repeat step 2. 

5. Estimate cluster memberships c by a row- wise majo rity count in C (brea king ties at 
random), use the truematch algorithm or heuristic ( Oehlschlagel . 2007bl ) to align c 



with c, and rename the clusters in c like the corresponding clusters in c . 

6. For each row in C, add one vote (add 1) to the column corresponding to the cluster 
membership in c . 

7. Repeat from step 4 until some reasonable convergence criterion is reached. 

8. Divide each cell in C by its row-sum to get a matrix of estimated cluster membership 
probabilities P . 

Remark 1: Resampling with replacement was chosen because this can be applied to 
samples of any size and reflects the usual assumption that the sample stems from an infinite 
population. With this choice, special care is needed to avoid difficulties with duplicate cases, 
for example, duplicated initial centers with the k-means base algorithm. A finite-population 
setting, a very large sample size or a base cluster algorithm's intolerance to duplicated values 
might justify a different sampling scheme. 

Remark 2: When a base cluster algorithm and a prediction method are computationally 
expensive and scaled to very large samples, a variation of MMCC might scale better: 
subdivide the sample into sufficientl y overlapping subsample s and integrate these to get an 
initial C, similar to sugg estions bv Istrehl and Ghoshl (j2002h . Then match the subsample 



solutions without prediction and vote only for the cases in the subsamples. Row-sums of C 
will no longer be equal. 

Remark 3: Estimation of Pk is robust as a consequence of the resample aggregation. 
Like in bagging, the influence of outliers is reduced because they are not sampled into 
all resample models. Unlike deterministic optimization procedures that are exposed to 
outlier influence in each convergence step, outliers can only influence some steps during 
the stochastic convergence of the Pk matr ix. Resample aggregat ion can be interpreted as 



a stochastic version of the EM-algorithm ([Dempster et all 119771 ): estimating the missing 



class labels c from C is clearly an e-step. Matching the next resample solution c to the 
current best estimate c is a maximization step. Unlike classical EM, this is not an m-step 
maximizing the full model but only an optimized voting improving the model stochastically. 
Details on convergence will be provided elsewhere. 

Remark 4: For the variation in cluster solutions, we focused on case sampling since this 
gives rise to clear s tatistical i nterp retation of the probabilitites in P^-. However, analogous 



to random forests (Brciman, 2001 ), it s hould be possible to extend truecluster to attribute 



sampling, cf. IStrehl and Ghoshl (120021 ) . It is obvious that whatever sampling scheme is 
used, it must be equal across all K in order to obtain comparable CICk- 

Remark 5: MMCC might be seen as a special case of creating and aggregating a cluster 
ensemble. Because truecluster focuses on identifying the best number of clusters K rather 
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than creating consensus across different K or across different base cluster algorithms, en- 
sembles can be very large and there are as many ensembles as candidates for the best 
K which can easi l y exh aust computer memory. For example, the methods suggested by 



Strehl and Ghoshl (|2002l ) would require holding all cluster solutions of all resamples simul- 



taneously in the memory. Implications for software architecture will be provided elsewhere. 

All the calculations in this paper have used resampling with replacement, prediction, 
and the truematch heuristic that is obtained with matchindex(method="heuristic") 
(jpehlschlagell . l2007ah . 



5. Cluster information criterion (CIC) 

After having introduced the truecluster framework, this paper focuses on the evaluation of 
the results of the MMCC aggregated voting: how to condense the information in P to a 
single value, guiding selection of optimal K: the cluster information criterion (CIC). 

Predictive class modeling can be described as declaring the existence of K classes with 
i.i.d. cases and it is clear that more classes result in more homogeneous distributions and 
higher log-likelihood. To cite Gideon Schwarz: "In such cases the maximum likelihood 
principle invariably leads to choosing the highest possible dimension. Therefore it cannot 
be the right formalization of the intuitive notion of choosing the 'right' dimension" (1978, p. 
461). Thus, Akaike's AIC and Schwarz' BIC penalize model certainty (the log-likelihood) 
by model complexity (the degrees of freedom). For the resample aggregation matrix P, the 
contrary is true: more complex models are usually — with some exceptions — less stable. In 
this case, the maximum certainty principle invariably leads to choosing the lowest possible 
number of clusters. Therefore, it cannot be the right formalization of the intuitive notion 
of choosing the right number of clusters. It is important to realize that this limitation of 
certainty-only approaches extends to all stability-only approaches, including those that look 
for non-random stability. Following the logic to correct the shortcoming of a certainty-only 
approach, we suggest rewarding the model certainty for the model complexity. Therefore, 
we define the cluster information criterion of the ET-cluster model as 



CIC = model certainty + model in formation . . 

= model information — model uncertainty 1 

a trad e -off b etween model information and model uncertainty, both measured in bits 
( Shannon . 19481 ). The CIC trades off information against uncertainty in evaluating the 



combination of the base cluster algorithm and prediction method. Given a fixed base 
cluster algorithm, prediction method, and resample size, the CIC can be used for objective 
automatic model selection — without the need to specify a null reference distribution. In the 
following section, we derive the calculation of model information, model uncertainty, and 
further diagnostics. 

Imagine a system with K states and let's begin with the simplifying assumption that 
all states are equally likely. Let's assume we don't know the system's actual state and call 
this our uncertainty and let's measure this in bits 



uncertaintyK = log2K 
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Let's define information as the reduction of uncertainty when we get to know the actual 
state k £ {1...K} 

informationK = log2K 

So for a system with K = 4 states, our uncertainty is log2 (4) = 2 bit and we can gain 
2 bit of information. Now let's generalize and introduce different probabilities pk for our 
states. It is obvious that we don't have any uncertainty if p = 1 for one state and p = 
for the other states. Uncertainty is maximal if all states have equal probability p = 1/K . 
The amount of uncertainty (or gainable information) of such a probabilistic system can be 
quantified as its entropy 

K 

entropy = - ^ Pklog 2 Pk (2) 

k=l 

It is instructive to note that for equal probabilities p = 1/K this simplifies as it should 

to 

K 1 1 
entropy K = - ~]7 lo 92j7 = lo 92K 
k=i 

Now we can give our probabilistic system the interpretation of a random distribution and 
recognize that the entropy is the weighted average of —logiPk (weighted by state probabil- 
ity). In other words: entropy is the expected value of the information gained after randomly 
sampling one observation from our distribution. Equation [2] can be used to measure the 
model information of a cluster model declaring K clusters of a certain size and using crisp 
assignments of cases to clusters. 

The classical measure of model certainty in predictive modeling is the log-likelihood, 
representing the probability of the data observed given the model. Applying this logic to 
our P matrix we get 

N 

pseudo log2likelihood = log2pi jC (3) 

i=l 

where Pj jC denotes for each case i the probability of the most frequently voted cluster. 
Two things are wrong with Equation [3j First, it's not really a likelihood because the 
observations are not independent with respect to their cluster assignments; second, no crisp 
cluster memberships c have been observed. Instead, our model P states probabilities Pj >t 
estimating how likely it is that case i belongs to each of the clusters. Therefore, we generalize 
Equation [3] to the non-crisp case. Following Equation [2] we replace per case the value of 
log2pi, c in Equation [3] through the expected value Ylk=i Pi,kl°9iPi,k ( or zero if A,c = 1) 
across all clusters 

. N K 

model uncertainty = — — Pi,k^og2pi y k (4) 

i=l fc=i 
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In this definition of model uncertainty, we have switched the sign and additionally di- 
vided it by the sample size iV in order to make our measure independent of sample size. 
For a crisp matrix with all cluster member probabilities Pj jC = 1 and all other Pj . = 0, 
Equation 0] reduces to Equation [3l but generally our model u ncertainty evalu ates all cells 
of P. Equation H] can be interpreted as a conditional entropy ( MacKay . 20031 ) of clusters, 



given the cases. In the co ntext of fuzzy clustering, Equation 0] is known as partition en- 
tropy (jBezdek et all Il980h . which — without further correction — is known to depend on the 



number of clusters. 

Now being equipped with a definition of model uncertainty, we can easily show — 
following Schwarz — that the uncertainty alone is not sufficient to evaluate cluster models. 
The following three example matrices P all have the same uncertainty (and log-likelihood) — 
zero — but they obviously differ in model information: 

10 10 10 

10 10 10 

10 10 10 

10 10 10 

01 10 0100 

01 10 0100 

1 10 10 

0110 0100 
01^01^0010 

1 1 10 

1 1 10 

1 1 10 

1 1 1 

1 1 1 

1 1 1 

1 1 1 

The first matrix has two classes with probabilities { j, |} and, thus, entropy = —\log2\ — 
|Zog2§ — 0.8, the second has entropy = —^log%^ — hlog2^ = 1 and, the third has 
entropy = —A^log2j = 2. Clearly the amount of information delivered by these models 
is different — the last one being the most informative. The reader will have noted that this 
was a crisp example. Analogous to the model uncertainty, we now generalize from the 
information of a crisp cluster membership vector to the non-crisp case: we want to avoid 
information loss resulting from considering marginals only. The marginal cluster probabili- 
ties pk in Equation [2] can be interpreted as the column-means of (a crisp) P. This suggests 
actually defining p^ as the column-means of the non-crisp P and to focus on the amount 
of information gained on average by knowing Pj.. when randomly sampling one case: we 
define D as the conditional information of cases, given the clusters, 

T N P u 

£>i t k = -Pi,klog 2 (l - \Pi,k - Pk\) (6) 
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which evaluates the difference between the case probabilities P^. and the average column 
probabilities pk- It can be easily seen that for crisp cluster assignments Equation [6] reduces 
to Equation [2j 

Measures such as Equations [2] and [6] signal too much information for over-complex 
models. For a model that assigns each case into its own private cluster, we formally get 
entropy = log2N, where in fact the model does not deliver any information at all. Therefore, 
we need to penalize Di^ for model complexity. Simply penalizing for the degrees of freedom 
K by scaling down with (1 — Jx) implicitly assumes equal cluster sizes. As a more general 
measure for model complexity than |£, we suggest the relative model complexity RMC 

2&k=i Pk l °92Pk) _ 1 

RMC = _ (7) 

which takes values between (no model complexity) and 1 (maximum model complex- 
ity). We now can define the cell- wise model information Ij & by penalizing Di k for RMC 
(Equation [8|). By summing rows and averaging over columns of I, we obtain the model 
information (Equation [9]) that we need in order to estimate the CIC in Equation [TJ 

4k = A,fe • (1 - RMC) (8) 

- K N 

model information = — ■ 4fc (9) 

fc=l i=l 

For diagnostic purposes, we might want to express the CIC as a function of cell-wise 
components ClC^k 

cic = jj ■ YliLi Ylk=i cic i: k 

where 

cic itk ii,k-{-Pi,kiog 2 P hk ) ( 10 ) 

}° 92 (i-\P^ k -p k \)( i - RMC -> . 



The model information in Equation [9] quantifies how much the model tells us, on average, 
about a case randomly drawn from the sample (when replacing the marginal p k by the case- 
specific row Pi.). The model uncertainty (Equation S|) quantifies how much uncertainty, 
on average, the case-specific model claims involve. The CIC finally is the expected value 
(across cases) of the case-wise expected value of the amount of information delivered by 
a ratio that becomes big, if cluster k has high probability in case i, without having high 
probability in general (and without beeing penalized for over-complexity). Analyzing CJCj & 
over cases or clusters can give valuable diagnostic insights. Finally, as an easy-to- interpret 
case-wise diagnostic, we suggest the generalized silhouette diagnostic (GSD) 



p. 

1 l,C 



GSDi = ^ 2-1 (11) 

Pi,c + Pi,c2 

ranging from (ambiguous assignment) to 1 (unambiguous assignment) where Pi fi i is 
the estimated probability for the second best cluster pe r case. Func t ions fo r calculating 
CIC and GSD are available in R package truecluster (|Oehlschlagell . l2007ah . see ?CIC. 
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6. Scalability 

CIC model comparison is a computationally intensive method. Comparing 10 cluster mod- 
els with 1,000 resamples requires 10,000 applications of the base cluster algorithm. This 
appears to be expensive but this standardized method is cheaper than ad-hoc manual model 
comparison; more importantly, it is scalable to big samples: the critical scalability compo- 
nent of resample aggregation is the base cluster algorithm. If it is scalable, truecluster is 
scalable as well. If the base cluster algorithm scales badly (takes too long or too much 
computer memory to handle N cases), truecluster still allows fitting the model to the full 
dataset if the following assumptions are met: the base cluster algorithm scales to n cases, 
this size of resample is sufficient to catch the complexity of the true cluster pattern, and 
the prediction method scales sufficiently. In very large samples with n <C N, the critical 
component of the resample aggregation is the prediction method which is needed to classify 
those cases not in the resample. If no specific scalable prediction method is available, we 
can always resort to 1st nearest neighbor prediction. Determining the nearest neighbor 
of each datapoint is needed only once and can then be used for each resample prediction 
and for each K. Naive nearest neigh bor iden t ificati on has time complexity 0(iV 2 ). Unless 
dimensionality is too high, kd-trees ( Bentlev . 19751 ) can speed up neares t neighbor identi 



fication, especially when exploiting the 'all nearest neighbor' situation (jGray and Moore) . 



200d ). When following the MMCC Remark 2, no prediction is needed and voting is done 



in batches which actually scales the base cluster algorithm close to O(N). Thus, truecluster 
scales arbitrary base cluster algorithms to large datasets depending on the scalability of the 
base cluster algorithm with O(N) (or an upper bound of 0(N 2 )) for time complexity and 
space complexity 0(N • K) (for one cluster solution given K). Truecluster computations 
can be accelerated by distributing subtasks on parallel computing nodes. When comparing 
several models, each can be fitted on a separate node. Furthermore, in fitting each P^, 
computations for r resamples can be distributed across r separate nodes. 



7. Example 



Mahon ( Campbell and Mahon . 19741 ) recorded data on 200 specimens of Leptograpsus var- 
iegatus crabs on the shore in Western Australia. This occurs in two colour forms, blue and 
orange, and he collected 50 of each form of each sex and made five physical measurements. 
These were carapace (shell) length CL and width CW, the size of the frontal lobe FL and 
rear width RW, and the body depth BD. The latter was measured somewhat differently 
for m ales and females. This dat aset has frequently b e en re- analyzed (jVenables and Riplevl . 
19941 ) and is publicly available (jVenables and Riplevl . 12002 ) . 

While the original analysis asks whether there are two morphologically distinct species 
or not, as an illustrative example we ask here whether a cluster algorithm will detect the 
four true classes in the data. We choose the measurement with the largest scale (CW) 
as an indicator of individual crab size and express the other four measurements by their 
ratio to CW, instead of their absolute size. CW itself is sufficiently symmetric so we do 
not transform CW to log scale. These measurements are then sphered using principal 
component analysis (using the correlation matrix) to define the cluster space. The data is 
shown as a scatterplot matrix in Figure [TJ 
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Figure 1: Crab data in principal component space 

(B,b,0,o code blue and orange, males and females) 



Looking for convex clusters, we choose partitioning around medoids (PAM) ([Kaufman and Rousseeuw . 
as our base cluster algorithm and as a prediction method we assign out-of-resample 



cases to the closest medoid (in euclidean space). W e begin the analysis by applying stan- 
dard PAM for 2 to 10 clusters as implemented in R (jfiousseuw et all 120041 ). Table Q] shows 
the arithmetic means of the GOF silhouette widths. Following this criterion, the 5-cluster 
solution appears to be best. The table also shows the information, uncertainty, and CIC 
from the respective truecluster models based on 1000 bootstrap samples (n=200) or on 1000 
subsamples of size n=100 (drawn with replacement). According to both truecluster models, 
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a 5-cluster solution is rejected: the 2-cluster solution has the lowest uncertainty and the 
4-cluster solution has — correctly — the best CIC. 

Table [2] shows the agreement of the cluster solutions with the true classes: truecluster 
provides robust optimized solutions that show excellent agreement (and better agreement 
with true classes than standard PAM solutions). 

The truecluster bootstrap 200 and subsample 100 solutions disagree only in one case, 
the case with the lowest generalized silhouette diagnostic. GSDs of truecluster 200 and 100 
correlate with r=0.986. Comparing standard PAM versus truecluster (200), we find that 
with respect to species, 143 cases are correctly classified by both cluster methods and in 27 
cases both methods fail (Table [3]). Of the remaining 30 cases, standard PAM fails in 29 and 
truecluster only in 1 case. Looking for disagreement with respect to species and gender, we 
find 22 cases, of which standard PAM fails in 17 and truecluster only in 5 cases. 





standard 


truecluster 200 


100 


cluster^ 


silhouette 


information 


uncertainty 


CIC 


CIC 


2 


0.131 


0.406 


0.736 


-0.330 


-0.404 


3 


0.182 


0.859 


0.738 


0.121 


-0.267 


4 


0.209 


1.012 


0.736 


0.199 


-0.119 


5 


0.225 


1.055 


0.738 


0.108 


-0.196 


6 


0.217 


1.042 


0.813 


-0.025 


* 


7 


0.185 


1.058 


1.129 


-0.071 


* 


8 


0.204 


1.050 


1.186 


-0.136 


-0.619 


9 


0.214 


0.983 


1.351 


-0.368 


-0.811 


10 


0.216 


0.981 


1.386 


-0.405 


* 



Table 1: Evaluation of PAM cluster models (* degenerated to fewer clusters) 





2 clusters x 2 species 


4 clusters x 4 species/gender 




standard 


100 


200 


standard 


100 


200 


fraction matched* 


0.720 


0.920 


0.860 


0.845 


0.900 


0.905 


kappa* (Cohen. I960) 


0.440 


0.840 


0.720 


0.793 


0.867 


0.873 


rand (Rand. 1971) 


0.595 


0.852 


0.758 


0.873 


0.909 


0.912 


crand (Hubert and Arabie. 1985) 


0.190 


0.704 


0.516 


0.663 


0.755 


0.765 



Table 2: Agreement of standard and truecluster PAM with true classes 
(* after matching clusters) 



Figure [2] plots the silhouette values against the truecluster GSDs and shows the localiza- 
tion of the failures with respect to these diagnostics. While standard PAM failures are not 
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code 


2 species 


4 species/gender 


both OK 


o 


143 


164 


standard PAM fails 


s 


29 


17 


truecluster PAM fails 


t 


1 


5 


both fail 


X 


27 


14 


TOTAL 




200 


200 



Table 3: Failures to assign true classes 



associated with low silhouette values, at least all pure truecluster failures have low GSDs, 
which gives some confidence that GSDs are useful. 
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Figure 2: Failures to assign true species /gender classes 

(s = standard PAM fails, t = truecluster fails, x = both fail, o = both ok) 



To check convergence, we repeated the truecluster bootstrap procedures 100 times and 
monitored how the CIC results stabilized as we aggregate more and more resamples (Figure 
[3]). For the truecluster bootstrap procedure, 98% of the repetitions favoured the 4-cluster 
solution after aggregating 1000 resamples. The truecluster procedure with subsamples con- 
verged faster and reached 100% decisions for the 4-cluster solutions after aggregating 550 
or more resamples. 

So far we have seen that truecluster was able to identify a plausible best (4-cluster) model 
and that the estimated cluster memberships agreed quite well with the true classes. Usually 
in cluster analysis we don't know the true classes: we are looking for them. After having 
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identified a best model in the sample, we still don't know whether the clusters really exist in 
the population. Resample aggregation does not make sense for the 1-cluster solution, CIC is 
by definition. A positive CIC of any K > 1 solution is an indication of non-randomness 
of that solution, but we cannot assume that the border between randomness and non- 
randomness is exactly at CIC = 0. The CIC is not only a function of the data but also of 
the resampling scheme, the base cluster algorithm, and the prediction method. In order to 
check for a non-random pattern using simulation validation, we need the assumption of a 
reference null distribution. We simulated 1001 successive random samples (n=100) from a 
multivariate normal distribution with a variance-covariance structure like the original data, 
then fitted for each a PAM-4 model, and calculated the rand agreement for each of the 1000 
successive pairs of cluster solutions (Figure HI black distribution) . Similarly, we calculated 
1000 rand values from 1001 PAM-4 models on 1001 subsamples (n=100) of the original 
data (red distribution). Resampling here clearly resulted in higher agreement compared 
to simulation from a null distribution and the degree of non-overlap between these two 
distributions is a strong indication that we identified a non-random clustering. 

While the red distribution evaluates agreement between pairs of resamples, the blue 
distribution in Figure [4] shows the agreements between resample solutions and the standard 
sample solution: this is expected to have higher agreements because the standard sample 
solution is based on the full sample, unlike the solutions from the subsamples which typically 
include only about 40% of the sample cases here. This expectation turns out to be true, 
but note that the green distribution shows even higher agreement between the truecluster 
solution and the resamples: a clear indication that aggregation of subsamples using only 
40% of the data gives better results than the standard solution based on the full sample. 
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Figure 4: Simulation validation and model checking 



In summary, the crab example confirms that it is possible to identify a best model based 
on aggregating resamples without the need for a reference null distribution, only evaluation 
of non-randomness requires such an assumption. Truecluster has identified the best (4- 
cluster) model, the estimated cluster memberships agreed quite well with the true classes, 
and the case-wise diagnostics have proven useful. Appendix A shows truecluster results for 
some illustrative artificial datasets. 
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8. Discussion 

We have presented a statistical framework for robust scalable clustering with model se- 
lection for the optimal number of clusters. It assumes that the data has been sampled 
randomly from an infinite population and mimics this sampling in order to estimate cluster 
models and to evaluate their quality and stability. Truecluster works with arbitrary defini- 
tions of c luster space and clu steriness and, for example, can be applied to symmetry-based 
k-mcans dSu and Choul . l200lh . The benefits of the truecluster approach are: robust cluster 



assignments, useful case-wise diagnostics, and a unified framework (also allowing for a uni- 
fied software interface) to select the best number of clusters. Using subsampling instead of 
bootstrapping can help to scale 'expensive' base algorithms to large samples and subsam- 
pling does not automatically reduce solution quality: reducing the resample size turns the 
base cluster algorithm more towards a 'simple base learner' leading to more robust solu- 
tions and faster convergence. The computational burden of truecluster may appear high, 
but it is an economic alternative to expensive manual validation. Instead of a separate 
evaluation of quality and stability which is computationally expensive as well, truecluster 
does an integrated evaluation of quality aspects and stability based on generally applicable 
information-theoretical criteria. 

The benefits of truecluster come with two limitations that deserve further research: 1) 
truecluster gives us a best model for the sample — without relying on the assumption of 
a reference null distribution — but it does not guarantee that the cluster pattern chosen is 
non-random and we still might over- or underfit the pattern existing in the population; 
2) comparison of CIC across base cluster algorithms requires very cautious interpretation: 
base cluster algorithms usually differ in model flexibility with respect to the original data, for 
example, take convex clusters versus arbitrarily shaped clusters. While the model certainty 
part of the CIC penalizes such greater model flexibility, the model information part of the 
CIC does not reward it. Consequently, CIC will favour less flexible base cluster algorithms 
over more flexible ones. This is not a problem as long as we know which kind of model 
flexibility we need and, therefore, choose an appropriate model class. Model comparisons 
across models d iffering in flexibility, to our understanding requires full Bayesian modeling 
dMacKavl . l2003h of the original data: solving this generically in software is an enterprise 



which is much more complicated than the task we have addressed here. 
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Appendix A. 

In this appendix we present truecluster res ults of artific i al exam ples with known cluster 



The colour of each datapoint represents the cluster to which truecluster has assigned it. 
The reliability of the assignment (generalized silhouette diagnostic) is shown colour-coded 
in the center of each datapoint. Cases with certain assignment are filled in black, cases with 
uncertain assignment are filled in white, and cases in between in grey. 

Figures [5] and [6] give examples of convex shaped clusters. The truecluster version of 
partitioning around medoids correctly identifies the correct number of clusters and correctly 
assigns the cluster memberships. 

Figures [7] through [10] show examples of well-separated arbitrary-shaped clusters. The 
truecluster version of single link agglomeration correctly identifies the correct number of 
clusters and correctly assigns the cluster memberships. 



structure. The datasets are available online (jQehlschlagell . 12007; 
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Figure 5: Truecluster correctly recognizes the 4 clusters, not only the 2 groups (ma- 
genta/blue versus red/green). Cases on the border between magenta/blue or 
red/green are marked (white) as uncertain assignments. 
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Figure 6: Truecluster correctly recognizes the 4 clusters, although they are not well- 
separated. Cases on the border between the clusters are marked (white) as un- 
certain assignments. 
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Figure 7: Truecluster correctly recognizes 2 elongated clusters 
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Figure 8: Truecluster correctly recognizes the 3 differently shaped clusters. 

This data is similar to an example from SAS Institute showing that their density- 
based MODECLUS can detect the 3 clusters if parameters are chosen correctly. 
Truecluster does not require the correct choice of such parameters. 
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Figure 9: Truecluster correctly recognizes 2 clusters, although one is topologically enclosed 
by the other. 
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